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Many published research results are false |1|, and controversy continues 
over the roles of replication and publication policy in improving the reli¬ 
ability of research. Addressing these problems is frustrated by the lack of 
a formal framework that jointly represents hypothesis formation, repli¬ 
cation, publication bias, and variation in research quality. We develop a 
mathematical model of scientific discovery that combines all of these ele¬ 
ments. This model provides both a dynamic model of research as well as 
a formal framework for reasoning about the normative structure of sci¬ 
ence. We show that replication may serve as a ratchet that gradually sep¬ 
arates true hypotheses from false, but the same factors that make initial 
findings unreliable also make replications unreliable. The most impor¬ 
tant factors in improving the reliability of research are the rate of false 
positives and the base rate of true hypotheses, and we offer suggestions 
for addressing each. Our results also bring clarity to verbal debates about 
the communication of research. Surprisingly, publication bias is not al¬ 
ways an obstacle, but instead may have positive impacts—suppression of 
negative novel findings is often beneficial. We also find that communica¬ 
tion of negative replications may aid true discovery even when attempts 
to replicate have diminished power. The model speaks constructively to 
ongoing debates about the design and conduct of science, focusing anal¬ 
ysis and discussion on precise, internally consistent models, as well as 
highlighting the importance of population dynamics. 
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Introduction 


Imagine two of your close colleagues have just heard about attempts to 
replicate their positive research findings. Colleague A is thrilled that the 
attempt was successful. Colleague B is upset that the attempt was unsuc¬ 
cessful. What is the probability that Colleague A's hypothesis is true? What 
is the probability that Colleague B's hypothesis is false? 

This is not a fair quiz, because in truth no one knows the answers to these 
questions. The absence of replication in many fields |2j-Q, combined with 
the absence of a formal framework for understanding replication, makes it 
difficult to even outline an answer. In the absence of replication, there is 
substantial concern that many published findings may be false [ 11, an argu¬ 
ment with empirical support |5j|7j. The history of science buttresses these 
observations. A recent catalog of false discoveries of chemical elements out¬ 
numbers the current number of real elements in the periodic table [81. In 
addition to concerns about replication are concerns about research practice 
and publication bias. Without knowing how many studies were conducted 
but not published, it is not possible to assign evidential value to either ini¬ 
tial findings or replications. And it is not yet easy to acquire empirical evi¬ 
dence about these factors, as even the best empirical studies of publication 
bias still rely upon researcher self-report [3|. 

Thus many opinions can be sustained about the evidential value of both 
initial findings and replications. As a result, recent controversies over failed 
replications demonstrate a lack of consensus on norms for replication and 
publication fl9^[l2). What is the evidential value of replication, positive or 
negative? What is the impact of publication bias [131? If replication is part 
of an "invisible hand" [ 141 that corrects scientific errors, how much repli¬ 
cation is needed? And what are the risks of poorly designed or interpreted 
replication attempts [9|? When replication is not possible or practical, what 
other measures can be taken to improve the reliability of research? 

These questions remind us that little is understood about the population 
dynamics of discovery, replication, and scientific communication. Much 
more attention has been given to individual methods of research design and 
data analysis. And while it is useful to analyze research methods in isola¬ 
tion, such calculations are unsatisfying. A lot of research activity is hidden 
from the public record. This means the actual number of findings for an hy¬ 
pothesis may never be known fll3) . And since researchers select hypotheses 
for further study from the literature itself, findings and publication biases 
cascade into other findings, interacting with biases and incentives [ 151. 

To know the evidential value of research, we must study the popula¬ 
tion dynamics that produce it [14 16-18]. So here we construct and solve a 
mathematical model of scientific beliefs formed by a population of bound- 
edly rational agents who accumulate evidence for and against hypotheses. 
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We adopt a general signal detection framework that may apply to diverse 
statistical paradigms, whether p-valued or Bayesian. We study the joint 
dynamics that arise from replication, publication bias, and differences in 
research quality between original studies and replications. Our goal is not 
to accurately simulate science, but rather to understand it better using the 
same reductionist tools that have been so successful in illuminating pop¬ 
ulation dynamics more generally [19.201. Our model implicitly provides, 
for example, a neutral model of scientific dynamics in which all hypotheses 
are false and yet discoveries are continuously published. It also provides a 
range of "selectionist" models that might be compared to data. The clarity 
of a quantitative framework will stimulate and clarify the development of 
later empirical investigation and experimental intervention. 

The paper proceeds by first outlining the dynamic structure of the model. 
We then solve the model for both its long-run dynamics and its epistemo¬ 
logical implications—what should a rational agent believe about an hy¬ 
pothesis, given a record of published results? We present a general interpre¬ 
tation of the joint dynamics, so the reader can extrapolate lessons from our 
simple model to the complexity and diversity of real science. We conclude 
by relating our results to ongoing debates about improving the reliability 
of scientific research. 


Model Description 

The model is illustrated in Fig.[TJ We have also constructed an interactive, 
web-based tutorial on the conceptual foundations of the model, as well as 
fully adjustable simulation code, available at http://xcelab.net/replication/ 
A population of researchers studies many different hypotheses. Each hy¬ 
pothesis is either true (green) or false (red). These hypotheses could be sim¬ 
ple associations, such as green jelly beans cause acne ED- or more general 
claims, such as evolution is predictable. Research results in either a positive or 
a negative finding. These findings may be the result of formal hypothesis 
tests or informal assessments. True hypotheses produce positive findings 
more often than do false hypotheses, but the researchers never know for 
sure which hypotheses are true. Under these assumptions, the only infor¬ 
mation relevant for judging the truth of an hypothesis is its tally, the differ¬ 
ence between the number of published positive findings and the number 
of published negative findings for each hypothesis, and we summarize re¬ 
sults in terms of these tallies. In reality, much other information is relevant 
to judging the truth of an hypothesis. Our assumptions are tactical ones. 
More complex models of scientific communication are possible, but any 
such model must include the components in our model, and so our results 
establish a critical baseline. 




4 


MCELREATH & SMALDINO 
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FIGURE 1. Population dynamics of replication. 


Each time interval, research activity has three stages that alter these tal¬ 
lies. In stage 1 (Fig. [l] upper-left) each researcher chooses to investigate 
one of n previously published hypotheses, with probability r, or a novel 
hypothesis, with probability 1 — r. When replicating, a researcher chooses a 
previously published hypothesis at random and performs a new study of it. 
Later, we allow researchers to target hypotheses with specific tally values, 
rather than choosing at random. A novel hypothesis is true with probability 
b, the base rate, reflecting mechanisms of hypothesis formation. Untutored 
intuition, for example, may be expected to yield a very low b. Genome wide 
association studies likewise have low b, because relatively few loci are as¬ 
sociated with any particular phenotype. There is no consensus on base rate, 
except that most scientists we know believe their own personal b values are 
better than average. So we allow b to vary freely in the model. 

In stage 2, a true hypothesis produces a positive finding 1 — /3 of the time, 
its power. A false hypothesis produces a positive finding oc of the time, its 
false positive rate. We assume that 1 — f> > oc. Later we allow the values of f> 
and oc to differ between replication attempts and initial studies. Note that f 
and oc are not merely properties of a statistical procedure, but rather of an 
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entire investigation. For example, using several procedures and selecting 
the one that produces a positive result will inflate a [221. 

In stage 3, findings may be communicated to other researchers. Not ev¬ 
ery finding is communicated, either because no one tries to communicate it 
or rather because it cannot be published. Only communicated findings can 
adjust a tally. Let c N be the probability that a negative (—) finding about 
a new (N) hypothesis is communicated. We assume for simplicity that all 
new positive results are communicated (cn + = 1). Even though replication 
findings are evidentially equivalent to novel findings, they may be commu¬ 
nicated with different probability. Let Cr and Cr , be the probabilities that 
replications with negative and positive findings, respectively, are commu¬ 
nicated. 

These assumptions define the dynamics of the expected numbers of true 
and false hypotheses with a given tally. We present the full recursions in 
the Supporting Material. In the simplest case (full communication: Cn_ = 
Cr_ = Cr + = 1), the number nj rS of true hypotheses with an observed tally 
s in the next time step is given by: 

nj,s , n T,s-i 

n 


l T,s = ”T, E 


am 


«X£ + «Xf = l ( 1 ig ) 




( 1 ) 


where a > 0 is the rate of research activity as a proportion of n. This ex¬ 
pression says that the number in the next time step is just the current num¬ 
ber plus all of the flows in and out caused by replications. In the case that 
s = — 1 or s = 1 , there is an additional term an( 1 — r)bf> or an( 1 — r)b( 1 — /l), 
respectively, to represent the inflow of novel findings. Recursions n' ¥s for 
false hypotheses are constructed from a change in variables: 1 — j3 —> a, 
b —> 1 — b. Notice that this implies that the model is easily extended to any 
number of hypothesis types, such as effect size differences, that differ in 
power and false-positive rate. We analyze the true/false dichotomy because 
of its prominence and simplicity. 


Analysis 

By literature review, a tally can be constructed for any given hypothe¬ 
sis. Given an observed tally, but a number of possibly unobserved studies, 
what is the probability that an hypothesis is correct? The model allows us 
to address this question for a diversity of scenarios. Before presenting the 
solutions, note that the answers that the model provides can be understood 
both from a pure population dynamics perspective and from a probabilistic 
reasoning perspective. From the dynamics perspective, the population will 
converge from any initial condition to a unique steady state in which the so¬ 
lutions give frequencies of true hypotheses at each tally value. Equally valid 
is the epistemological perspective that the solutions tell us for any unique 
hypothesis the probability it is true, given a state of information p3fl . One 
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consequence of this is that the solutions do not require that all hypotheses 
share the same parameter values. 

For each tally value s, we solved for the steady state proportions of true 
and false hypotheses, pj rS and pr /S . We also derived the same solutions un¬ 
der the probabilistic interpretation, and verified our solutions numerically 
and through stochastic simulation. We present complete analytical solu¬ 
tions in the Supporting Material. In the simplest case (for full communica¬ 
tion), solutions take the form: 

Px, = 6(1 -r) E s) ) (1 - ( 2 ) 

This expression defines an infinite geometric series of binomial probabilities 
arising from all of the different possible histories by which a true hypoth¬ 
esis could achieve a tally of s, for every possible number of findings m. In 
the majority of cases, only the first few terms of the series are important, 
because of the leading factor r m_1 . This fact also informs us that the rate of 
convergence to steady state will be quite rapid, unless r is large. 

For any particular tally, for example s = 1, expression (J2j) yields a closed- 
form solution like: 

( 3 ) 

For arbitrary communication parameters, the solutions have a similar struc¬ 
ture, but are instead a series of multinomial probabilities in which the events 
are combinations of findings (+ or —) and communication outcomes. 

These solutions are not easy to interpret by inspection. But they do pro¬ 
vide answers to the question: what is the probability that an hypothesis with a 
given tally is correct? For any tally s, we can calculate: 

Pr(trueIs) = -— ^ T,s „ —, Pr(sltrue) = , Pr(slfalse) = (4) 

Pt,s + Pf,s LiPxi Lip¥,i 

The precision of a tally s is Pr(true|s), the proportion of hypotheses with 
tally s that are true. The sensitivity, Pr(s|true), is the proportion of true hy¬ 
potheses with tally s. It indicates where the true hypotheses are. Sensitivity 
is important because a high precision for a tally s is little help when there 
are few hypotheses that achieve a tally s. And the specificity, Pr(s|false), is 
the proportion of false hypotheses with tally s, indicating where the false 
hypotheses are. We use these definitions to explain the behavior of the sys¬ 
tem. 

Overall dynamics. Fig. |2] describes the overall dynamics of precision, as a 
function of the different parameters. In each panel, the trend lines show 
the proportion of true hypotheses at each tally on the vertical axis. The 
tally corresponding to each trend is indicated by a number. The horizontal 
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axis in each panel varies a single parameter. Each vertical hairline shows 
the value of each parameter that is held constant in other panels. This fig¬ 
ure is complex. We'll use it to highlight the most important factors in the 
reliability of findings and demonstrate counter-intuitive aspects of commu¬ 
nication. Then in the next section, we'll turn to a more general explanation 
of the causes of these results. 

There are two clusters of plots. The top cluster represents a normatively 
optimistic scenario, with an auspicious base rate (b = 0.1), unusually high 
power (1 — /S = 0.8), low false-positive rate (a = 0.05), and high communi¬ 
cation rates. The bottom cluster represents a pessimistic, or perhaps more 
realistic [j24j|25|, scenario with low base rate (b = 1/1000), lower power 
(1 — /3 = 0.6), higher false-positive rate (a = 0.1), and publication bias re¬ 
sulting in low communication of replications and negative findings. The 
range of base rates we show represents everything from genome wide as¬ 
sociation studies, on the low end (b < 10 4 ), to predicting the winner of a 
presidential election, on the high end (b = 0.5). Every scientist will have 
a different opinion about which values represent realism. So in the Sup¬ 
porting Material, we provide a Mathematica notebook for reproducing and 
altering these plots, so the reader can explore alternative scenarios of in¬ 
terest. But keep in mind that unrealistic scenarios are just as important for 
comprehending system dynamics. 

First, notice that at tally s = 1 very many research findings are false. In 
the top cluster, the base rate must get quite high before a majority of hy¬ 
potheses with tally s = 1 are true. In the bottom cluster, only the highest 
displayed base rates are sufficient. This dynamically replicates Ioannidis' 
direct calculation [ 11, even in the absence of bias and multiple testing. Many 
initially published findings are false, unless the base rate is high, and with¬ 
out any invocation of fraud or researcher bias. 

Second, notice that replication helps, but how much it helps varies greatly. 
In the top cluster, even one positive replication at s = 2 renders most hy¬ 
potheses true, at a base rate of b = 0.1. At lower base rates, s = 3 or s = 4 is 
required to raise precision above one-half. In the bottom cluster, low power 
and high false-positive rate make replication quite inefficient. Even at high 
base rates, s = 3 is needed. At low base rates, s = 5 or more is required. 
In either cluster, achieving near-certainty that an hypothesis is true always 
requires replication, even with a base rate as high as b = 0.1. In general, 
the same factors that make initial findings unreliable also make replications 
less reliable. 

Note also that the rate of replication, r in panel (b), has remarkably little 
impact. This is because replication impacts the rate at which hypotheses 
reach different tallies, but not so much the precision at each tally. Therefore 
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FIGURE 2. Effects of base rate, replication, power, false-positives, 
and communication on the probability that an hypothesis with a 
given tally is true. The two clusters illustrate difference scenarios. 
The blue trends, each labeled with its tally value, show precision 
as it varies by the parameter on each horizontal axis. The numbers 
indicate the tally of a curve. Dashed curves are tallies of an even 
number. The vertical hairlines show the parameter values held 
constant across panels within the same cluster. 


at low replication rates, few hypotheses will ever attain s = 5, but those that 
do are almost certainly true. We expand on this point in the next section. 

Third, communication of findings, panels (e-g), can both assist discovery 
or hinder it. Suppression of negative replications (e) reduces precision. But 
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suppression of positive replications (f) and novel negative findings (g) ei¬ 
ther improves precision or has almost no impact on it. These aspects of the 
population dynamics are counter-intuitive, but quite general and revealing. 
The next section explains them. 

Dynamics of communication. The "file drawer problem" [13| arises when 
the failure to publish negative findings distorts the estimated strength of an 
association. We consider a related phenomenon by asking how changes in 
the communication parameters c’n _, Cr_, and Cr + alter the precision, sensi¬ 
tivity, and specificity across tallies. In the process, we'll have opportunity to 
explain the joint dynamics of research quality and communication biases. 

In this model, it is rarely best to communicate everything. In the Sup¬ 
porting Material, we prove for the case of small b (such that b 2 ~ 0) and 
small r (r 3 ~ 0) that c'n < 1 will improve precision when a < fi (usually 
satisfied), that c R _ < 1 improves precision when a > \ (hopefully never 
satisfied), and that c R+ < 1 improves precision whenever /l — a < U Of¬ 
ten satisfied). So some suppression of novel negative findings (c N _ < 1) 
and positive replications (cr + < 1) can improve the value of replication. 
At larger b and r, the conditions are more complicated, but the qualitative 
finding remains intact. 

To grasp why suppressing findings might help us learn what is true, 
think of replication as epistemological chromatography. Chromatography is 
a set of techniques for separating substances that are mixed together. For 
example, mixed plant pigments can be separated by painting the mixture 
onto the tip of a strip of filter paper and then soaking the tip in a solvent. 
Different pigments bind more or less strongly to the solvent or the paper. 
Therefore as the paper absorbs the solvent, different pigments travel at dif¬ 
ferent speeds, eventually separating and appearing as differently colored 
bands on the paper. In the epistemological case, it is true and false hy¬ 
potheses that are mixed. We wish to separate the true ones from the false. 
Replication applies a "solvent" that diffuses false hypotheses towards neg¬ 
ative tallies and true hypotheses towards positive tallies. A true hypothesis 
diffuses upwards with probability (1 — /3 )cr + , while a false hypothesis dif¬ 
fuses downwards with probability (1 — «)cr_. Thus the communication 
parameters adjust rates of diffusion. Just as manipulating rates of chemical 
diffusion can improve real chromatography, manipulating communication 
can improve epistemological chromatography. 

In Fig. [3j we turn on communication one parameter at a time, in order 
to explain the contribution of each mode of communication to the resulting 
population dynamics. All four panels (a, b, c, d) show steady state preci¬ 
sion, sensitivity, and specificity and use b = 0.001, r = 0.2, 1 — /3 = 0.8, 
and a. = 0.05. These values are chosen for clarity of illustration. In the Sup¬ 
porting Material, we provide a Mathematica notebook to construct plots for 
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(a) Positive only 



(b) Negative only 
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(c) Screen and check 



(d) Total communication 



cn- = 0, cr- = 0, cr+ = 1. Only positive 
findings are initially communicated, and 
replication can only increase tallies, which 
are here counts of positive findings. True 
hypotheses diffuse upward faster than 
false ones. So large tallies have a high 
precision, the proportion of true 
hypotheses. 

cn- = 0, cr- = 1, cr+ = 0. Tallies can only 
decrease. False hypotheses diffuse down 
faster than true ones. But since the 
mixture at tally +1 is mostly false, 
precision is always low. 


Solid: cn- = 0, Cr- = 1 , Cr+ = 1. Up 
diffusion of true hypotheses is aided by 
down diffusion of false ones from the 
mixed source tally +1. Compare precision 
to (a). Dashed: cn- = 0, cr- = 1 , cr+ = 0.2. 
Suppressing positive replications regulates 
the rate of up diffusion, purifying high 
tallies at the price of sensitivity. 

cn- = cr- = cr+ = 1. True and false 

hypotheses diffuse in both directions, and 
everything is communicated. Since most 
effort investigates new findings at tally -1, 
few hypotheses ever achieve a high tally, 
but those that do have high precision. 


FIGURE 3. Replication and communication as epistemological 
chromatography. Precision is indicated in blue, sensitivity in or¬ 
ange, and specificity in gray. 


any parameters the reader chooses. Note that for sensitivity and specificity, 
probability above/below the highest/lowest tally displayed is added up on 
the highest/lowest tally, so that none of the probability mass is hidden. 

In the first three panels (a, b, c), only positive initial findings are com¬ 
municated, and all new hypotheses appear at tally s = 1. The mixture of 
hypotheses at this tally is heavily skewed towards false hypotheses, and 
so has a low precision. Replication may cause an hypothesis to diffuse in 
either direction, depending upon communication. In panel (a), negative 
findings are never communicated. But since true hypotheses diffuse up at 
a rate 1-/3 and false ones only at a rate a < 1 — ]8, truth is slowly separated 
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from falsity. At tallies of 8 or more, nearly all hypotheses are true, as indi¬ 
cated by the precision. Note however that most true hypotheses that have 
been communicated at all exist at low tallies, as indicated by the sensitivity. 
With enough time and replication, every true hypothesis can be split from 
the false. This is unlike the case in panel (b), where only negative repli¬ 
cations are communicated. The same dynamic works in reverse here, and 
replication creates a pure sample of false hypotheses at low tallies. 

Combining both directions of diffusion is synergistic, as illustrated in 
panel (c). Now both positive and negative replications are communicated. 
The downward diffusion of false hypotheses makes the upward diffusion 
of true hypotheses more efficient. This effect arises because 1 — a > 1 — /3. 
False hypotheses diffuse down faster than true hypotheses diffuse up. This 
purifies the source mixture at s = 1, allowing for precision to approach 
high values at much smaller tallies than in the absence of either diffusion 
process. In this example, hypotheses with tallies of s = 3 and greater are 
true more than 80% of the time, and the sensitivity indicates that more than 
half of all published true hypotheses have a tally of 3 or more. Keep in 
mind that this 80% is equally interpretable as a probability that applies to 
a unique hypothesis. So it provides epistemic value, independent of the 
frequency interpretation. 

Diffusion in both directions is enhanced by suppressing some positive 
replications. The dashed curves in panel (c) provide a comparison when 
only 20% of positive replications are communicated. Precision is substan¬ 
tially higher in this case, but at the cost of reduced sensitivity at high tallies. 
This effect arises from the same dynamic as before: by setting Cr + < 1, we 
have effectively slowed all upward diffusion. This allows rapid downward 
diffusion from negative replications to further clean the source mixture, but 
at the cost of diffusing more true hypotheses towards negative tallies. This 
dynamic is beneficial when base rate is especially low. So we achieve a very 
clean sample of truth at smaller positive tallies in this scenario, but at the 
price of finding fewer true hypotheses in total. Whether this is an improve¬ 
ment depends upon context, an issue we take up in the discussion. 

Finally, full communication is illustrated in panel (d). High precision is 
achieved at high tallies, but few hypotheses reside at those tallies. This in¬ 
efficiency arises from the unbiased allocation of replication effort. When all 
initial findings are communicated, replication effort is overwhelmed by fol¬ 
lowing up on initial negative findings, the spike in specificity seen at tally 
s = — 1. When the base rate is low, it can be better to screen for positive find¬ 
ings than to publish every negative finding. Note however that increasing 
precision, the proportion of hypotheses at a given tally that are true, is not 
necessarily the only objective. It does us little good if sensitivity is very low 
at all high tally values. We return to this point in a later section, when we 
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consider differential power and false-positive rates between initial studies 
and replications. 

Targeted replication. Replication in the preceding analysis is purely ran¬ 
dom: every communicated hypothesis has an equal chance of being the 
target of a replication effort. Targeting particular tally values, like s = 1, 
might be more efficient. Here, we demonstrate that the main effect of tar¬ 
geted replication is to improve sensitivity, the proportion of true hypotheses 
at positive tallies. It has little effect on precision, the proportion of hypothe¬ 
ses at positive tallies that are true. 

To modify the population dynamics to allow targeted replication effort, 
assume that a proportion rj of all replication attempts target a chosen list of 
tally values, selecting an hypothesis randomly from all hypotheses within 
the list. For example, this list might consist of all previously communicated 
hypotheses with a positive tally of three or less, so that researchers con¬ 
centrate their replication efforts on hypotheses thought to be true but with 
relatively high uncertainty. The rest of the time, 1 — rj, replication effort 
remains unbiased. 

Fig. [4] shows the resulting modification of the dynamics. The dashed 
curves in these plots show the steady-state dynamics in the absence of tar¬ 
geting. The shaded pink regions show the range of tally values included in 
the target. In each case, targeting improves sensitivity at higher positive tal¬ 
lies. Thus it helps to diffuse true hypotheses towards tallies with very high 
precision. But there is very little effect on precision itself. Targeting helps 
because it directs effort towards tallies that may not have a high density of 
hypotheses. When replication effort is unbiased, most effort is directed to 
tallies where the bulk of hypotheses reside. Therefore when the target range 
includes a wide range, as in panel (c), it becomes relatively ineffective. 

Why doesn't targeting improve the proportion of hypotheses that are 
true at higher tallies? Targeting serves mainly to speed up diffusion, with¬ 
out altering the relative rates at which true and false hypotheses diffuse. 
Changes in communication rates, in contrast, do alter the differential rates 
of diffusion, and so may dramatically alter precision, as seen in the previous 
section. 

Differential power and false-positives. So far, we have assumed that power 
1 — /3 and false-positive rate a are the same in initial studies and replica¬ 
tions. Differences between initial studies and replications have been at the 
center of concerns about replication [91. Here we analyze a version of our 
model in which we allow the power and false-positive rate to vary. Let 
1 — /3 r and Ar be the power and false-positive rate, respectively, for replica¬ 
tions. What effects do both higher-powered replication and lower-powered 
replication have on dynamics? 
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Tally 

FIGURE 4. Targeted replication effort. In all three plots, tallies 
marked for targeted replication are shown by the shaded region. 
Precision is indicated in blue, sensitivity in orange, and specificity 
in gray. Baseline parameters set to b — 0.001, a = 0.05, r — 0.1, 
rj — 0.5, Cn- = 0, cr_ = cr + — 1. Dashed curves display steady- 
state without targeted replication, rj = 0. (a) High power setting, 
1 — i B — 0.8. (b) Low power setting, 1 — /6 = 0.6. (c) Low power, 
1 — /3 — 0.6, and including tally s = 0 in the target. 


In Fig. [5j we present two extreme, illustrative scenarios. Both scenarios 
use b = 0.001, Cn_ = 0, cr„ = cr + = 1 ,r = 0.2, and rj = 0 unless noted 
otherwise. The first is a "low/high" scenario in which initial findings are 
produced by studies with 1 — /3 = 0.6 and a = 0.2, but replications have 
conventional 1 — /Sr = 0.8 and Ar = 0.05. This scenario reflects a context 
in which initial studies use small samples and suffer from motivated data- 
snooping or data-contingent analysis that elevates false-positives [22.26|. 
This scenario is shown in panel (a). The second scenario is a "high/low" 
scenario, with 1 — /S = 0.8, a = 0.05,1 — /3r = 0.5, Ar = 0.05. This scenario 
reflects a context in which replications are prone to error, because a true 
effect requires skill to produce |9]. This scenario is shown in panel (b). 

Comparing the two, notice that low/high is more damaging overall, as 
the elevated false-positives cascade through the population during diffu¬ 
sion of hypotheses to higher tallies. Thus it takes more replication in (a) to 
achieve the same precision as in the high/low scenario (b). Even with only 
50% power in (b), replication successfully separates true hypotheses from 
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FIGURE 5. Differential power and replication dynamics. Preci¬ 
sion is indicated in blue, sensitivity in orange, and specificity in 
gray, (a) Low power initial studies (1 — )8 — 0.6, a = 0.2) but high 
power replications (1 — /Sr = 0.8, — 0.05). (b) High power 

initial studies (1 — )3 = 0.8, ol = 0.05) but low power replications 
(1 — /3 r = 0.5, — 0.05). (c) and (d) as in (a) and (b), respectively, 

but only 10% of negative replications are communicated. 


false ones. Unfortunately, it also diffuses many true hypotheses towards 
negative tallies. The high precision at positive tallies is a result of a false 
hypothesis' relative inability to attain a positive replication, not a result of 
a true hypothesis' ability to avoid a negative replication. 

In the last two panels, (c) and (d), we show how these scenarios change 
when negative replications are suppressed, Cr_ = 0.1. The situation gener¬ 
ally worsens in both cases, but failure to communicate negative replications 
does prevent true hypotheses from attaining negative tallies, in the case in 
which replication power is low, (d). 

Overall, replications continue to have value, even when they are more 
prone to error than original studies. As long as true hypotheses are more 
likely to diffuse upwards than downwards, replication aids discovery. 

Discussion 

Ours is the first analytical model of the joint population dynamics of 
scientific hypothesis generation, communication, and replication. Such a 
model is necessary to illuminate debates about scientific practice, because 
until researchers report the results of every study, empirical estimates of 
base rate are not possible. And without consideration of population dy¬ 
namics, any discussion of the value of research findings remains at least 
partly naive, because it is notoriously difficult to reason verbally about com¬ 
plex systems. Our model produces a number of valuable counter-intuitive 
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results. But even when its results are intuitive, some model like ours is 
needed to demonstrate their logic. It is not enough to merely hold the cor¬ 
rect belief; we must also justify that belief. 

This model is not a definitive representation of the scientific process, nor 
does it aim to be. It omits many relevant factors, such as investigator bias 
and disagreements about the interpretation of evidence. These omissions 
allow the model to address focused questions about the evidential value 
of research as it emerges from the joint dynamics of hypothesis generation, 
replication, and communication. Models that account for more and differ¬ 
ent factors must also include variants of these complex dynamics, so our 
model is a necessary and useful first step. 

Our analysis re-emphasizes what every textbook says: replication is an 
essential aspect of scientific discovery. However, it also quantifies its impact 
and emphasizes that replication itself can be unreliable—the factors that 
make initial findings unreliable also make replication less reliable. When 
base rate is low, power is low, or false positives common, then many suc¬ 
cessful replications will be needed to attain confidence in an hypothesis. 
This is especially true when negative replications are difficult to publish. 

We find that low base rate and high false positive rate are the most im¬ 
portant threats to the effectiveness of research, replicated or not. This re¬ 
emphasizes the importance of quality theorizing, in order to improve base 
rate. While it is appealing to think that science works regardless of where 
hypotheses come from, undisciplined hypothesis generation reduces base 
rate and makes initial findings mostly false. Then large amounts of repli¬ 
cation will be needed to uncover the truth. In fields such as physics and 
evolutionary biology, a great deal can be and is done to vet theory in the 
realm of pure thought, using mathematics and simulation. But in fields 
such as social psychology, theory development is rarely formalized | |27| . 

The results also re-emphasize the value of efforts to suppress false pos¬ 
itive findings, such as pre-registered data analysis plans. It is important 
to recognize that any single scientific hypothesis may correspond to many 
different statistical hypotheses. If a statistical hypothesis can be chosen af¬ 
ter seeing the data, reasonable scientific hypotheses can become unreason¬ 
ably flexible |28|. And many data-contingent transformations and model¬ 
ing choices that increase power, conditional on an hypothesis being true, 
will also increase false-positives, conditional on the hypothesis being false. 
For example, dropping outliers may well aid discovery, if the hypothesis is 
true. But it may also dramatically inflate false-positives, if the hypothesis is 
not true (29}. 

Our model immediately informs debates over the meaning of failed repli¬ 
cations. For example, some have suggested that positive replications have 
more worth than negative replications [fT2|, or even that failed replications 
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"cannot contribute to a cumulative understanding of scientific phenom¬ 
ena" [301. We find the opposite: communicating a failure to replicate is typ¬ 
ically more informative than communicating a successful replication. This 
remains true even when replication attempts have lower power than origi¬ 
nal studies. However, a single failure to replicate is entirely consistent with 
a true hypothesis in many scenarios. So both positive and negative repli¬ 
cations may be regarded with skepticism. But neither is without value. Of 
course our model is merely a model. But unlike the verbal arguments we 
cite, it is at least clear in its assumptions, and its logic can be verified. 

Our model also sheds light on proposals for improving the reliability of 
research. For example, many have called for pre-registration and review 
with a commitment from journals to publish research results, positive or 
negative, in order to reduce under-reporting of negative findings [31]. Our 
analysis suggests that these proposals should distinguish between new hy¬ 
potheses and replication attempts. If indeed many new hypotheses are false 
in many fields, a pre-registration process would merely fill journal pages 
with null findings, doing great harm by crowding out candidate hypothe¬ 
ses that have passed an initial screening. In our model, there is little harm in 
ignoring novel negative findings, because they add very little information. 
Indeed, Figure 2 illustrates that the effect of ignoring novel negative re¬ 
sults on precision is negligible. In contrast, a negative replication may add 
a lot of information. We suspect however that our model exaggerates this 
effect, because the model ignores the wasted effort arising from different re¬ 
searchers repeating an investigation in ignorance of one another's negative 
findings. And there are certainly fields in which full publication may be 
the best policy, such as when false-positive rates are low or when the total 
number of testable hypotheses is very small. Nevertheless, the qualitative 
difference in information value between novel and follow-up negative find¬ 
ings will remain as long as the base rate in the published literature is higher 
than it is in novel investigations. 

The model stimulates empirical investigation by clarifying which factors 
must be estimated in order to gauge the evidential value of research, as well 
as being readily translatable into a statistical framework, due to its analyt¬ 
ical specification. Our model provides an implicit 'null model' of research: 
setting b = 0 provides a null distribution of novel findings and lifespans 
of hypotheses. Null models are deliberately unrealistic and usually a priori 
false, but have nevertheless played an important role in science [201. 

There are additional factors to address in future work. Our model ig¬ 
nores researcher bias, multiple testing, and data snooping, each of which 
deflates base rate or inflates false-positive rate. Our analysis is framed in 
a standard, but unsatisfying, "true" and "false" classification, rather than 
considering practical significance and effect size estimation p6|. Our model 
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can be directly generalized to consider variation in effect size instead of true 
and false hypotheses. We explain this generalization in the Supporting Ma¬ 
terial. However, our model does not directly address causal inference nor 
point estimation. 

Incentives also matter. A dynamic analysis of strategic behavior under 
different incentive structures would aid policy analysis | jT8| . As Karl Pop¬ 
per argued, science does not work because scientists are selfless and unbi¬ 
ased people. Rather it works because its institutions channel our bias into 
the production of public goods p2| . In particular, we worry that a research 
environment that lacks replication may actually select for statistical prac¬ 
tices that inflate false-positives, as labs with such practices can more readily 
publish findings and place students in new positions, all while outrunning 
the truth. 

Replication may offer other benefits that are not accounted for in our 
model. A failed replication may be valuable because it inspires a new hy¬ 
pothesis in order to explain variation in findings. When findings do not 
generalize across samples, this creates an opportunity to explain the vari¬ 
ation [33 34j. In our view, the goal of replication is not merely to find the 


same result, but also to discover how a result arises and how it is likely to 
vary in realistic, non-laboratory, contexts. 

Despite these shortcomings, our model provides specific quantitative 
evaluations of many verbal arguments, as well as drawing attention to the 
population dynamics of scientific knowledge. Science is a subtle project. 
Understanding it demands the same rigor that we apply to projects within 
science itself. 
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SUPPLEMENTAL INFORMATION 
Replication, Communication, and the Population Dynamics of 

Scientific Discovery 

1. Derivation of full model with random replication 

Let fj /S = riT iS /n be the frequency of true hypotheses with tally s. Under 
the assumptions and definitions supplied in the main text, the full recursion 
for ri j is given by: 

n Ts = n T,s +anr( — /t, s (cr+(1 — fi) + cr_/5) + /t, s -i(1 — P) c r+ +/t,s+i^ c r-) 

(5) 

for s not equal to 1 or —1. In those cases, there is an additional term. For 
s = 1: 

*4,1 = h t,i (6) 

+ anr( — f T/1 (c R+ (l — ft) + Cr_/5) + /t,o(1 — P) c r+ + /t,2^ c r-) 

+ an(l — r)b( 1 — /6) 

The an(l — r)b(l — ft) term accounts for inflow of novel positive findings, 
all of which are communicated. For s = — 1: 

*4,-1 = n T-i (7) 

+ anr( — /t,-i(cr+(1 — /3) -L Cr_/3) + ~ P) C R+ + /t,o^ c r-) 

+ an( 1 - r)bj6c N _ 

The an( 1 — r)bf J >c^ term accounts for inflow of novel negative findings, 
only Cn_ of which are communicated. Recursions for false hypotheses can 
be derived just by substitution of variables: b —> 1 — b and 1 — fi —» a. 

These recursions implicitly define the population growth recursion for n: 

n' = n+an( 1 - r)(b( 1 - j8 + /6c N _) + (1 - fc)(a + (1 - a)c N _)) (8) 

This just indicates that the population of published hypotheses grows pro¬ 
portional to the innovation rate, 1 — r, and the rates at which true and false 
hypotheses respectively produce positive and negative findings, as well as 
the rate at which negative findings are communicated. 

2. Beyond "true" and "false" 

Above we noted that recursions for false hypotheses can be derived just 
by substitution of variables: b —» 1 — b and 1 — jS —» a.. In other words, 
true and false hypotheses are differentiated only by the rate at which they 
appear in new investigations and their respective probabilities of producing 
positive findings. This also means it is straightforward to expand the model 
to additional epistemic states, as "true" and "false" really just more more 
and less correct. For example, small, medium, and large effect sizes could 
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be represented by three states, each with its own base rate and probability 
of producing a positive result. The derivation would remain the same, but 
an additional set of steady-state solutions would appear. 

3. Steady-state solutions 

We have analyzed this model using a variety of methods. First, we solved 
the model analytically for every structure except for targeted replication (to 
be defined later). Second, when analytical solution was not possible, we 
solved the model numerically. Third, we studied the model under both 
deterministic and stochastic simulations, written independently by both 
authors in different programming languages. All forms of analysis yield 
identical results. 

The model above can be solved directly, in one of two ways. First, it 
can be solved exactly by bounding tallies within a minimum and maxi¬ 
mum (using either absorbing or reflecting boundaries) and then solving 
the system of simultaneous equations for values of the state variables fj /S 
for i G {T,F}. This approach is probably the most straightforward. Second, 
it can be solved to any level of approximation desired by iteratively solving 
the system of equations outward from s = 0. 

Both approaches yield solutions that take the form of closures of infi¬ 
nite geometric series expressions. Using these solutions, we found the un¬ 
bounded infinite series solution based upon intuition —ansatz is what our 
mathematics instructors used to call it. Since the solutions from the brute- 
force approach looked like closures of infinite series, and the simulation re¬ 
sults produced what resembled a mixture of geometric series, we guessed 
the underlying limiting distribution. We then verified our ansatz solution 
by plugging it back into the recursions and also by comparing it to numeri¬ 
cal results and our previous solutions. Finally, we induced the infinite series 
representation by constructing Taylor series expansions of the closed series 
expressions, yielding the sequential terms of the solution expression in the 
next section. 

3.1. Full communication solution. Here we repeat the simplest such so¬ 
lution from the main text and then motivate its justification. The steady 
state proportion of hypotheses that are both true and have tally s, when all 
findings are communicated, is given by: 

°° 11 
p T/S = b(l-r) ^r m - 1 X(rn,(m + s)/ 2 )(l-/ 3 ) 2 ( m+s )/ 32 ( m - s ) (9) 

m =1 

where K(m, (m + s)/2) is the number of ways to get (m + s)/2 positive 
findings in m investigations of the same hypothesis. This is simple the bi¬ 
nomial chooser, but implicitly evaluating to zero whenever (in + s) /2 is not 
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an integer. Since s is the difference between the number of positive and neg¬ 
ative findings, this multiplicity accounts for the number of paths by which 
an hypothesis can be studied m times and end up with a tally s. The re¬ 
maining terms leading with 1 — j6 and are just the probabilities of getting 
(w i + s)/2 positive findings and (m — s)/2 negative findings, respectively 

Here's how to motivate the above solution. For any given tally s, there 
are an infinite number of histories by which it could have ended up with 
that tally. 

• Consider tally s = 1, for example. If the hypothesis is true, it could 
end up most simply at s = 1 with just one initial positive finding. 
This happens with probability (1 — r)b(l — fi), indicating innovation 
times base rate of true hypotheses times the probability of an initial 
positive finding. 

• Similarly, if instead the hypothesis has been studied twice, which 
happens (1 — r)br of the time, the number of ways it could end up 
with s = 1 is exactly zero, and the multiplicity handles this by as¬ 
signing K(2, (2 + l)/2) = 0. 

• For three studies, there are K( 3,2) = 3 ways s = 1 could happen. 
Represented as sequences of positive and negative findings, these 

are: (1) + -|—, (2) -|-f, and (3) —f +. The probability of any one 

of these is (1 — /3) 2 /3, and the probability that an hypothesis is true 
and has been studied three times is (1 — r)br 2 . 

The pattern here generalizes so that the total probability is just: 

• the sum over number of studies on an hypothesis from m = 1 to 
m = oo of the probability the hypothesis was studied m times, given 
by (1 — r)r m ~ x 

• times the number of ways it could end up with a tally s in m steps, 
given by K(m, ( m + s)/2) 

• times the probability of getting (m + s)/2 positive and (m — s)/2 
negative findings. 

Writing down this summation and factoring out the common term b( 1 — r) 
completes the expression. 

This steady-state solution obviously assumes that there has been an infi¬ 
nite amount of research time, such that every m can be realized. In practice, 
since the sequence is geometric in r, the probabilities of higher values of m 
decline very rapidly and simulations confirm that steady-state is reached 
quite rapidly, as long as the replication rate r is not close to r = 1. 

More importantly we think, these solutions are never meant to describe 
actual science, but rather to allow us to reason about causal forces in actual 
science. So the steady state expressions are important even if, as in many 
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real dynamical system, they are never exactly realized. For example, prob¬ 
lems in evolutionary theory are routinely solved by asking what happens 
on the infinite time horizon. Such solutions have been incredibly useful, de¬ 
spite the fact that no real population or environment is stationary enough 
to make the exercise literally sensible. 


3.2. Arbitrary communication solution. When communication parameters 
are allowed to be less than one, the above strategy generalizes directly, but 
does become complex. The expressions get much more complex, because 
now the infinite series is over multinomial probabilities of three possible 
outcomes at each replication investigation of an hypothesis: (1) positive 
and communicated, (2) negative and communicated, or (3) not communi¬ 
cated. In addition, when findings are not always communicated, then the 
effective activity rate changes, making other probabilities conditional on 
observable activity. Still, these solutions can be derived both by the logic 
to follow or by brute-force solution of the system of recursions. Solving 
the system of recursions does allow for easily defining reflecting or absorb¬ 
ing tally boundaries, which may be appealing in some contexts. The com¬ 
binatoric solution to follow assumes unbounded tallies. Solutions in the 
bounded and unbounded cases are nearly identical, for all scenarios con¬ 
sidered in the main text. The Mathematica notebooks in the supplemental 
materials present code for both types of solution. 

We present the solutions here as a sequence of conditional probabilities, 
as we've found this form easier to interpret than the general multinomial 
form. Therefore they provide more insight. Specifically, we decompose the 
multinomial probabilities into a binomial series for observed/unobserved 
investigations of a hypothesis and a binomial series for positive/negative 
findings conditional on being observed. The solutions take the form: 

px,s = Pr(T) Pr(activity) Pr(new|activity) ((1 — /}) Pr(s|+) + /1cn_ Pr(s| —)) 

( 10 ) 

Where: 


Pr(T) = b (11) 

Pr (activity) = r + (1 - r)(b(( 1 - f*>) + /3c N _) + (1 - b)(ct + (1 - a)c N _)) 


Pr (new | activity) 


( 12 ) 


(1 - r) (b(( 1 - j6) + j6c n _) + (1 - b)(a + (1 - a)c N _)) 


Pr (activity) 


(13) 


The probabilities Pr(s|+) and Pr(s| —) give the probabilities of tally s aver¬ 
aging over number of investigations m and un-communicated findings u, 
beginning with either a positive finding or a negative finding, respectively. 



22 


MCELREATH & SMALDINO 


This conditioning is necessary because a tally s can be reached by different 
paths once communication is partial. These probabilities are given by: 

co m 

Pr(s|+) = h(s) + EE R m Pr(u|m)S(s — l|m — u) (14) 

m =1 u =0 
oo m 

Pr(s|-) = f-i(s)+ £ £ R m Pr(w|m)S(s + l|m — u) (15) 

m = 1»=0 

where I a (b) is a function that returns 1 when a = b and zero otherwise 
and R = r / Pr(activity) is the probability of replication, conditional on ac¬ 
tivity as defined earlier. The term Pr(w|m) gives the probability of u un¬ 
communicated findings in m investigations, defined as: 

-«•)- <«> 

where 


<?° = (1-/3 r )(1-c r+ ) + /3 r (1-c r _) (17) 


is the probability a replication finding is un-communicated, averaging over 
positive and negative findings. Finally, the function S(z|n) provides the 
probability that a sequence of length n communicated replication findings 
producing a difference z between positive and negative replications. It is 
defined as: 


S(z| n) 


Io(z) if n = 0 

K(n, (:n + z)/2)q ( '" +z ' ,/2 (l — (j + )^ _z ^ /2 if n > 0 


(18) 


where K(a, b) is again the binomial chooser function, but evaluating to zero 
when b is not an integer, and: 


_ (1 ~ fa)CR+ 
q+ l-qo 


(19) 


which is the probability of a positive replication, conditional on the replica¬ 
tion finding being communicated. 


4. Approximate conditions for reduced communication 

We argue in the main text that full communication is rarely optimal, 
from the perspective of precision. Consider the full communication con¬ 
text: Cn_ = Cr_ = Cr + = 1. For small b ( b 2 ~ 0) and small r (r 3 « 0), 
precision as defined in the main text is improved by reducing communica¬ 
tion parameters under the following conditions: 

• c N < 1 when a < fi (easy to satisfy) 

• cr_ < 1 when oc > 0.5 (hopefully not satisfied) 

• cr + < 1 when — a < 1 /4 
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These conditions are derived by first defining precision at s = 1, which is 
most conservative precision to investigate, because it benefits the least from 
replication, and higher tallies always have higher precision than s = 1. So 
improvements at s = 1 cascade upwards to higher tallies. Let PPVi be the 
precision at s = 1. Then the first condition is proved by computing the 
derivative 9 PPVi/9cn_, evaluated at full communication parameter val¬ 
ues. Then Taylor expand the result simultaneously by second-order around 
r = 0 and by first-order around b = 0. Neglecting terms of order 0(b 2 ) and 
0(r 3 ) and higher: 

9PPV 1 ^ — v 2 — —-9(j6 — «)(1 — (6 — a)(5 — 6a.) (20) 

9cn^ oc 

which is negative unless a > fi. Thus suppressing some initial negative 
findings is favorable, provided the base rate is small and replication is not 
too common. We think most scientific fields satisfy these conditions, but 
reasonable people can and do disagree on that point. 

In contrast, suppressing negative replications is unlikely to help. By the 
same strategy, but this time differentiating with respect to Cr_ : 

l-(S-«)(l+2r((S-«)) (21) 

which is guaranteed positive, indicating that Cr_ = 1 is favored, when 
a < 0.5, because by assumption 1 — /B > a. 

The third condition is derived similarly: 

~ -br 1 ^^-B-a)(l-4r(B-oc)) (22) 

9c r+ a. r 

The last term is the one in play. For the above to be negative, it is required 
that: 


r 


1 1 

< 4 j6 — a 


And this is guaranteed when j6 — ex. < 1/4. 


( 23 ) 
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